Week 3 Lecture

Species Distribution Modelling

Russell Dinnage

What are SDMs

  • Sometimes called Environmental Niche Models (ENMs, I prefer SDM)
  • Possible the most well-developed application of classic Data Science and Machine Learning (ML) in Biology
  • They take in data on the environment at different sites or parts of a landscape and try predict where (or when, or how many) organisms will occur there. Data on the species themselves can also be used
  • Traditionally done on single species at a time, but ‘joint’ species distribution models are increasingly common

SDMs as Data Science

  • To make an SDM you use a Data Science workflow
  • The final model used to predict species occurrence is fit using Machine Learning techniques
  • A Data Science workflow has two major components:
    • Data
    • a Model

Data and Models

  • Data Processing Steps

  • Model Building

  • Data feeds into Models

  • Models inform Data processing steps

Data Science Steps

  1. A Question
  2. Collect Data
  3. Munge / Clean Data
  4. Transform Data for Model
  5. Analyze Data using Model
  6. Tune Model
  7. Validate / Test Model
  8. Interpret Model

Data Science Steps

  1. A Question
  2. Collect Data
  3. Munge / Clean Data
  4. Transform Data for Model
  5. Analyze Data using Model
  6. Tune Model
  7. Validate / Test Model
  8. Interpret Model
  • We have covered some of this already
  • We will cover the rest in the next 4 weeks

What is a Model?

  • Encodes a relationship between a set of inputs and outputs
  • For an SDM this is a (potentially complex and nonlinear) function that takes environmental variables as input and its output is an occurrence pattern (could be abundance, probability, suitability, etc.)
  • Different Models differ in their assumptions, or ‘inductive biases’, and their parameters
  • Models are ‘fit’ to data
    • Parameters are chosen based on how well the Model’s output matches the real data
    • This is an optimization problem

The Simplest Model?

The Linear Model

  • Models outputs as a ‘linear’ function of inputs
  • Inputs are ‘predictor’ variables
  • Outputs are predictions, and they are compared to ‘response’ variable(s)
  • Assumes responses are a additive linear function of predictors

Fit a Linear Model to Reef Fish Data

Setup

library(sdmpack)
library(tidyverse)
library(tidymodels)

Fit a Linear Model to Reef Fish Data

  • Base R has a good collection of statistical methods including the linear model.
  • RF_abund has data on the abundance of different reef species at different reefs around the world
data("RF_abund")
RF_abund

Fit a Linear Model to Reef Fish Data

  • First let’s choose a fish species to model
  • Thalassoma pavo: Ornate Wrasse

fish_dat <- RF_abund %>%
  filter(SpeciesName == "Thalassoma pavo")

Fit a Linear Model to Reef Fish Data

  • lm() is the basic linear model function in R.
  • Let’s do a linear model on our fish species with temperature as a predictor
mod <- lm(
  AbundanceAdult40 ~ MeanTemp_CoralWatch, 
  data = fish_dat)

R ‘formulas’

mod <- lm(
  AbundanceAdult40 ~ MeanTemp_CoralWatch, 
  data = fish_dat)
  • A formula is a special data structure in R
  • It is specified using an expression of this form: lhs ~ rhs, where lhs stands for Left Hand Side, and rhs stands for Right Hand Side.
  • Formulas compactly express a relationship between variables: lhs contains ‘response’ variables that we wish to model as a function of the variables on the rhs: the ‘predictor’ variable
  • Both lhs and rhs can contain multiple variables and function calls (we will see examples of that later on)
  • Though a formula can be missing lhs, they must always have a ~ and a rhs (~ rhs is a valid formula)

summary(mod)

Fit a Linear Model to Reef Fish Data

ggplot(fish_dat, aes(x = MeanTemp_CoralWatch, y = AbundanceAdult40)) +
  geom_smooth(method = lm, se = FALSE, color = "red") + geom_point() + theme_minimal()

It is much easier to tell if the model is useful at all by visualizing.

Fit a Linear Model to Reef Fish Data

  • This model is not a good fit, and there might be some other problems
    • Models have assumptions
    • Does our data fit the assumptions?
  • I won’t go through all linear model assumption, but one is that the distribution of the data is not too extreme (skewed, or with long tails)

Fit a Linear Model to Reef Fish Data

Distribution of Abundance:

 hist(fish_dat$AbundanceAdult40, breaks = 100)

Fit a Linear Model to Reef Fish Data

  • One way to ‘fix’ the data is to transform it, for example:
 hist(log(fish_dat$AbundanceAdult40), breaks = 50)

However the main problem seems to stem from the fact that a line is not a good description on the response. Therefore, most SDM methods that are actually used allow for non-linear relationships. In this case (and many cases), some kind of hump shaped function might make sense.

tidymodels

  • tidymodels is an R package for statistical and machine learning models
  • It shares the philosophy of programming and data analysis with tidyverse
  • It has functions that mirror the step of Data Science that I presented earlier
  • Like tidyverse it is a meta-package, bundling several other packages together

parsnip: a tidymodels package to fit models

  • parsnip separates the model ‘structure’ from the implementation
  • It unifies inputs and outputs, so that it is simple to switch between different modelling methods without writing new code

Fit a Linear Model with parsnip

mod_pars <- linear_reg(engine = "lm")
mod_pars

Fit a Linear Model with parsnip

  • A model is fit using the fit() function
  • parsnip is designed work with pipe operators (%>% or |>)
mod_pars <- linear_reg(engine = "lm") %>%
  fit(AbundanceAdult40 ~ MeanTemp_CoralWatch,
      data = fish_dat)
mod_pars

Fit a Linear Model with parsnip

  • Parameters are extracted with the tidy() function.
mod_summ <- tidy(mod_pars)
mod_summ

Fit a Linear Model with parsnip

  • Visualize the fit by prediction
preds <- predict(mod_pars, newdata = fish_dat)
preds
pred_dat <- fish_dat %>%
  bind_cols(preds)

p <- ggplot(pred_dat, aes(x = MeanTemp_CoralWatch, y = .pred)) +
  geom_line() + geom_point(aes(y = AbundanceAdult40)) + 
  theme_minimal()
suppressMessages(print(p))

p <- ggplot(pred_dat, aes(x = MeanTemp_CoralWatch, y = .pred)) +
  geom_line() + geom_point(aes(y = AbundanceAdult40)) + 
  scale_y_continuous(trans = "log1p") + theme_minimal()
suppressMessages(print(p))

The power of parsnip

  • Now that the model is specified in parsnip, it is easy to change our model to one that can model nonlinear relationships easily. Let’s try doing a gradient boosted decision tree (you don’t need to know what that is for now).
mod_pars2 <- boost_tree(mode = "regression") %>%
  fit(AbundanceAdult40 ~ MeanTemp_CoralWatch,
      data = fish_dat)
mod_pars2

The power of parsnip

preds2 <- predict(mod_pars2, newdata = fish_dat)
pred_dat2 <- fish_dat %>%
  bind_cols(preds2) 

The power of parsnip

p <- ggplot(pred_dat2, aes(x = MeanTemp_CoralWatch, y = .pred)) +
  geom_line() + geom_point(aes(y = AbundanceAdult40)) + 
  scale_y_continuous(trans = "log1p") + theme_minimal()
suppressMessages(print(p))

Overfitting

An important concept in Data Science and Machine Learning is the idea of overfitting. The above model appears to ‘overfit’ the data – its predictions jump around wildly to try and fit each individual data point. But this in not likely to generalize well if applied to a new dataset. We reduce overfitting by tuning the ‘hyper-parameters’ of the algorithm to make it produce ‘smoother’ predictions. Smoothed prediction are more likely to generalize better to new datasets. This is accomplished using a method called cross-validation, which we will cover in depth in Week 7.

Forests of the Future or Forests of the Past?

Introducing the Pine Rocklands of South Florida

  • Unique to South Florida
  • High species diversity
  • Open canopy forests dominated by Slash Pine
  • Fire dependent ecosystem
  • One of the most critically endanger ecosystems in the world
  • Why?
    • Pine Rocklands central range is in the middle of Miami!
    • The only large patch remaining is in Everglades National Park

Image citation:

Trotta, Lauren B., et al. “Community phylogeny of the globally critically imperiled pine rockland ecosystem.” American journal of botany 105.10 (2018): 1735-1747.

FIU’s Pine Rockland

  • The FIU Preserve on campus has one of the few remaining patches of Pine Rockland in Miami-Dade
  • In 2016 FIU conducted a prescribed burn to help restore this ecosystem of local importance

Project Proposal

  • I propose we do one big group project on the Pine Rocklands in weeks 11-16
  • We will start here in lecture – I will use data on plants in the Pine Rockland for my examples in the next few weeks.
  • For the final project, each of you can choose a species from the Pine Rocklands (or a few species) to run a SDM on and use it to make predictions. What is the past and future of Pine Rockland species?
  • You will all follow the same steps of good Data Science
  • You can work in groups or on your own, it is your decision
  • At the end of the semester we will combine everyone’s SDMs into a prediction for the whole Pine Rockland community (or at least 18-30 species worth)

We will start by extracting data from this:

  • I will start using that data for example in class next week.
  • Next class we will be working on the Week 5 Assignment (you can start today).
  • Next week we will cover:
    • How to go from abundance data to presence / absence (e.g. occurrences)
    • How to include data preprocessing and model fitting in one workflow using tidymodels
    • A little bit of ecological niche theory to understand what SDMs represent and why they are important